ADF Check Sum Computation v1.5.3 RF

The Allotrope Data Format (ADF) [[!ADF]] consists of several APIs and taxonomies. This document consitutes the specification of the ADF Check Sum Computation API for creating hash codes on an ADF file, or parts thereof.

Disclaimer

THESE MATERIALS ARE PROVIDED "AS IS" AND ALLOTROPE EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE WARRANTIES OF NON-INFRINGEMENT, TITLE, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

Introduction

An underlying design principle of the ADF check sum is that a local change (e.g., in a part of a data cube) must not require re-reading the entire file to compute the check sum.

In principle, ADF is designed to support the choice between multiple algorithms to compute the check sum. Currently, there is only one implementation, which depends on the internal representation storage as an HDF5 file. This document describes this algorithm and its application to various component parts of the ADF file.

The document is structured as follows: First, the prerequisites are defined, followed by a detailed description of the hash computation on different parts of an ADF file.

Document Conventions

Namespaces

Within this specification, the following namespace prefix bindings are used:

Prefix	Namespace
owl:	http://www.w3.org/2002/07/owl#
rdf:	http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:	http://www.w3.org/2000/01/rdf-schema#
xsd:	http://www.w3.org/2001/XMLSchema#
dct:	http://purl.org/dc/terms/
skos:	http://www.w3.org/2004/02/skos/core#
qb:	http://purl.org/linked-data/cube#
adf-dp:	http://purl.allotrope.org/ontologies/datapackage#
adf-dc:	http://purl.allotrope.org/ontologies/datacube#
adf-dc-hdf:	http://purl.allotrope.org/ontologies/datacube-to-hdf5-map#
adf-audit:	http://purl.allotrope.org/ontologies/audit#
lc-hash:	http://id.loc.gov/vocabulary/cryptographicHashFunctions/

Number Formatting

Within this document, decimal numbers will use a dot "." as the decimal mark.

In the following, we often need to represent lists of bytes. For each byte, we use two hexadecimal digits; for several bytes, we simply append those digits.

Nomenclature

We use the terms "check sum" and "hash" synonymously.

Preliminaries

Data Types

When converting data to byte strings, different data types must be handled separately. In the rest of the document, we always specify which data type is used when an object is converted to bytes.

Numeric Data Types

The integer data types are written in a two’s complement with "big endian" byte order. The table below shows an example for each type:

Java data type C# data type Example value Bytes

byte sbyte 45 2d

short short -7498 e2b6

int int 1318336784 4e943910

long long -4895739457839457 ffee9b59d4b3de9f

Java data type	C# data type	Example value	Bytes
byte	sbyte	45	2d
short	short	-7498	e2b6
int	int	1318336784	4e943910
long	long	-4895739457839457	ffee9b59d4b3de9f

The floating point data types (float and double) are written according to IEEE 754 with “big endian” byte order.

Java data type C# data type Example value Bytes

float float -16e10 d21502f9

double double 3254e43 4996cc9385c1f043

Java data type	C# data type	Example value	Bytes
float	float	-16e10	d21502f9
double	double	3254e43	4996cc9385c1f043

Strings

We encode strings by first writing the number of characters of that string as int (see above), followed by the string itself encoded as UTF-8.

Example: "Hällo World!" (umlaut ä instead of e is intended for the example) is encoded to the bytes 0000000c48c3a46c6c6f20576f726c6421. The first four bytes, 0000000c, encode the number of characters (12), and the next 13 bytes are the UTF-8 encoding of the string. Note that the umlaut ä is encoded to 2 bytes (c3a4) after the H (48), which leads to one byte more than the length of the string.

Message Digest Algorithms

ADF uses existing message digest algorithms to convert a stream of bytes to a check sum.

Currently supported are MD2, MD5, SHA-1, SHA-256, SHA-384, and SHA-512. The checksums in ADF (in general and for the DataPackage) are by default intended to ensure the integrity of the ADF file against accidental storage or transfer errors. To do this while providing the best performance, the default algorithm is set to MD5. This algorithm is no longer considered to be cryptographically secure, but for the intended purpose one of the best candidates. However, users can choose a configuration with one of the other algorithms. The checksum service can be configured per ADF file when activating hashing for the file. For DataPackage the algorithm can be set for each DataPackage file created individually.

When we state that data is added to a message digest, we mean that the data is converted to bytes (by the rules above), and then the resulting bytes are sequentially added to the digest algorithm. This must yield the same result as constructing an array of bytes that is filled by the outcomes of the conversions and then running the message digest algorithm on it.

Combining hash values hierarchically

We use a hierarchical approach to combine the hash values of several parts of the ADF file.

Let’s assume we have three parts p1, p2, p3 with the following hash codes (all 32 bytes, represented as hexadecimal numbers):

Part	Check Sum
p1	112233445566778899aabbccddeeff00112233445566778899aabbccddeeff00
p2	1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef
p3	fedcba9876543210fedcba9876543210fedcba9876543210fedcba9876543210

We can again use these values to compute a new check sum by adding first the bytes of p1, then p2, then p3. With SHA-256, the result would be f502ee2fea0b4af39f7998476b2f3f5c8a50b9d6d36c21202e4acaff36649e9a.

Let’s assume that something in p2 changes, such that the check sum changes to 111122223333444455556666777788889999aaaabbbbccccddddeeeeffff0000. To compute the overall check-sum, we do not need to re-compute p1 and p3 again. Instead, we take the digest of the 3*32 bytes of the three check sums. This would result in f36e3bf258044eedb87f8e7546e434fa3e8d247de74d76ba5e86c9dc9313715a.

In ADF, we use this mechanism at several levels to minimize computation costs.

For the sake of brevity, we use a fictitious digest algorithm, SHA-32, for examples, where we just take the first 4 bytes of the SHA-256 algorithm’s result.

Configuration

In an ADF file's meta model, the file itself is represented by a resource ?F, which is of type hdf:File. If hashing is enabled, the meta-data model will contain the statement

      ?F adf-audit:hasDigestMethod  [
         a adf-audit:DigestMethod;
         adf-audit:hasCanonicalizationAlgorithm adf-audit:c14n-adf-hdf-1.0 ;
         adf-audit:hasDigestAlgorithm lc-hash:sha256
      ]

where adf-audit:c14n-adf-hdf-1.0 refers to the chosen algorithm and lc-hash:sha256 to the message digest algorithm used to calculate a digest from a byte stream.

Currently, there are two implementations of an ADF hash algorithm, of which one is only available for read-only access to older files to support backward compatibility. The only currently available choice when initializing check sums for a file is presented below. The configuration will allow to use other implementations in the future.

Valid options for the message digest algorithm are:

lc-hash:md2
lc-hash:md5
lc-hash:sha1
lc-hash:sha256
lc-hash:sha384
lc-hash:sha512

The file's hash value is stored in the HDF5 root group's attribute ADF_CHECKSUM, encoded as a hexadecimal string.

Specification of the algorithm ADF-HDF-2.0

The algorithm's URI is adf-audit:c14n-adf-hdf-2.0. This is the current default implementation for the check sums.

The algorithm is specific to the usage of HDF5 as the underlying file format and internal representation of the data structure. In particular, the serialization of the data description is not deterministic in the sense that the same content of the data description might result in different binary representations in the file, depending on the order of additions or removals.

The computation of the hash value is based on the structure of the ADF file's HDF5 representation. We define the computation of hashes for HDF5 datasets and HDF5 groups. The overall hash value of the file is defined as the HDF5 root group's hash value.

For each HDF5 group or dataset, the computed check-sum is stored in an attribute ADF_CHECKSUM. This allows to re-compute the complete check sum when only a part of the file has been changed. During verification, it allows to detect in which part of file data has been corrupted.

Computation of HDF5 group hash values

The hash value of an HDF5 group is based on its attributes and its child elements (other groups and datasets). The following values are added to the message digest algorithm:

If the group is not the root group, the group's name is added as an UTF8 string
If the group has any attributes that are not excluded (see below), the UTF8 string attributes is added.
The attributes (except for the excluded ones) are sorted by their name, and each is added to the hash by
1. Adding its name as UTF8 string
2. Adding its value, depending on its HDF5 type:
  - HDF5 integers (type class H5T.INTEGER whose size is smaller 4 or whose size is 4 and whose sign is not 0) are encoded as integers.
  - HDF5 integers (type class H5T.INTEGER which do not fulfill the conditions of the previous item and whose size is smaller 8 or whose size is 8 and whose sign is not 0) are encoded as longs.
  - HDF5 strings (type class H5T.STRING) are encoded as UTF8 strings.
If the group has any child elements, the UTF8 string elements is added.
The group's child elements are sorted by their name; for each child which is not excluded (see below) we add:
1. The name of the child as UTF-8 string
2. The bytes of the child's hash value. If the child is a group, that value is computed recursively.

Excluded attributes

Attributes with the following names are not included in any hash:

ADF_CHECKSUM
checksum-adf-hdf-2.0
adf-hdf-checksum-algorithm

Excluded groups

The group /check-sums and any sub-group of it will not be included in the computation of the hash value.

Computation of HDF5 dataset hash values

For every non-scalar data set, a check sum data set with the same number of dimensions is generated. It has the same path and name as the original dataset, but is located under the HDF5 group /check-sums. A scalar dataset is one without any dimensions. Such a dataset just contains just a single value.

For every dimension i, a number hashblock_i is chosen that describes the number of elements that are grouped together in that direction to compute a single hash-value.

We denote the size of the data set in dimension i with size_i.

Let’s assume we have a 2-dimensional data set with data type short of size 10x5. Let’s also assume that a hashblock₁ is 3 and hashblock₂ is 2. With some example data, we have the following groups that are usually 3x2 in size, but are smaller at the borders:

A 10x5 data set with 3x2 hashblocks.

If the original data set's path was /data-package/some-file.txt, the hash data set's path would be /check-sums/data-package/some-file.txt.

In the hash data set, we store a check sum for each group. Each dimension of the hash data set is of size hashsize_i = size_i / hashblock_i, where we round up (hashsize_i = ceil(size_i / hashblock_i)).

How hashblock_i is chosen is up to the implementation. The chosen values are stored in the HDF5 attribute hash_block_size as a string which contains the comma-separated block sizes.

For this example, we need a hash data set of size 4x3.

The 4x3 hash data set.

The hash data sets's attribute hash_block_size contains the string 4,3.

Each hash value of a block is the result of computing the message digest of the following values:

The number of dimensions as int
For each dimension i: the number of elements in the block as long
All elements of the hashed block according to the type of the data set. The order of the elements is as follows: We start with the first coordinate in every dimension (e.g. (0,0) for two dimensions), then we increase the last dimension first and the first dimension last.

The data set used to store the check sums is of data type byte; the last dimension is multiplied by the length of the resulting digest (e.g. 32 for SHA-256) to store all bytes of the hash in the data set.

The overall check sum for the data set is computed by adding the following data to the message digest algorithm:

For each dimension, the size of the hash digest is added (as long)
We iterate over all coordinates of hashes, starting with the last dimension. For each coordinate, we add the hash contained in the hash data set to the message digest. In case of a scalar dataset there is just one block with one value which is taken directly from the source dataset because no hash dataset has been created.
The attributes of the original data set are sorted by their name and for each attribute (except those listed above) we add to the digest:
1. The name of the attribute (a string)
2. The value of the attribute (depending on the attribute type, a string, an integer, or a long value)

Specification of the algorithm ADF-HDF-1.0

The algorithm's URI is adf-audit:c14n-adf-hdf-1.0.

Please note: This algorithm is deprecated and only documented here for reference to support older ADF files. Current implementations do not support writing files with it and future versions might drop support completely.

The algorithm is specific to the usage of HDF5 as the underlying file format and internal representation of the data structure. Especially, the serialization of the data description is not deterministic in the sense that the same content of the data description might result in different binary representations in the file, depending on the order of additions or removals.

To compute the hash code, a hash code for separate parts of the ADF file is computed.

Data Cubes
Data Package
Data Description
Audit Trail (if auditing is activated)

The hash value of each part is added to the input of the message digest that computes the overall check sum.

Computation of the data cube part

The overall hash for the data cubes is computed by adding

The URI of the data cube (a string)
The hash value of the data cube (a list of bytes)

for each data cube to the message digest algorithm.

Information on the hash is stored in the meta data for the Cube ?Cube with

            ?Cube adf-audit:hasDigest  [
               adf-audit:hasDigestMethod lc-hash:sha256;
               adf-audit:digestValue "..."^xsd:base64Binary
            ]

where "..." contains the base-64-encoded hash value of the data cube.

Computation of single data cube's hash

To compute the hash of a data cube, we first determine the measure datasets and scale datasets that are used to store the content of the data cube from the meta-data. The RDF resources representing the data sets are sorted alphabetically by their URI. The hash value of the data set is added to the message digest. How the hash value of a data set is computed is described below.

If the data cube contains strings or IRIs in its measurements or scales, the cube contains a dictionary that contains a representation of the strings. If a dictionary is present for the data cube, all the data sets of the dictionary are sorted alphabetically by the URI of the RDF resources that represent them. For these datasets, additional attributes of the hash set are included in the hash value, see below.

Computation of the data package part

For computing the hash value of the data package, we only consider the data sets of the files, not the directory layout and other meta-data like time of last change. The directory layout and meta-data are stored in the technical model of the ADF file, whose hash computation is described below.

Each file in the data description is backed by an HDF5 dataset. First, we take the RDF representation of all files (resources whose type is adf-dp:File) and sort them by their URI. For every file, we add

the URI of the file's resource (a string) and
the hash value of the file's data set (see below).

Computation of the data description part

Several HDF5 datasets are used to store the information. Their hash values are added to the message digest for the computation of the overall data description hash. The following list contains the data sets whose hash values are added, with attribute names that must be included in the hash (see below).

Path	Attributes	Include size attribute of group
dictionary/bytes	nextId	yes
dictionary/keys	nextId	no
dictionary/nodes	nextId	no
nodes_GSPO/nodes	nextId	yes
nodes_GPOS/nodes	nextId	yes
nodes_GOSP/nodes	nextId	yes
nodes_SPOG/nodes	nextId	yes
nodes_POSG/nodes	nextId	yes
nodes_OSPG/nodes	nextId	yes
quads	nextId, size	no

The path given in the table is relative to the HDF5 group /data-description,

e.g. /data-description/dictionary/bytes.

For each data set in the list, the following data is added to the message digest:

The path of the data set (as string)
The size attribute of the group containing the data set (for those data set marked accordingly in the list) (as integer)
The hash value of the data set

The hash values of the data sets are not stored in the data description itself. They are stored as a hexadecimal string in the HDF5 attribute "CHECKSUM" instead.

Computation of the audit part

Like the data description, the audit store contains RDF graphs. To compute the audit's check sum, we first add the dictionary's dataset's check sums analogously to the data description, only the path is relative to /audit-trail.

Then, all data sets that are included in any audit trail are sorted alphabetically by their URI and for each data set

The URI of the dataset resource (as string)
The hash value of the data set (as byte string)

Computing Check Sums of Data Sets

In the current version of ADF, the file format HDF5 is used as an underlying storage solution. Most of the stored information in an ADF file are stored in HDF5 datasets. This section describes how the check sum of a data set is computed.

Because data sets can contain large amounts of data, we apply again the approach of combining multiple hashes to one overall hash.

A data set has n dimensions and a data type. Each entry of the data set belongs to the data type. In ADF, the data types byte, short, int, long, float and double can be used.

When we use coordinates of elements of a data set in the following, we start with 0 (not 1).

Constructing a Data Set for Check Sums

For this version of the algorithm, check sums data sets are created according to the section on the more recent algorithm, with a few deviations:

The path to the check sum data set is stored in the RDF metadata as described below.
The hash block size is stored in the RDF metadata instead of an attribute as described below.
Only certain attributes are added to the hash, as defined in the section about hashing data desciption above.

Meta-information for the stored datasets

Information about data sets that store hash values are stored as RDF data in the technical data description. Let's assume that a data set is represented by the resource adf://some_dataset. A resource representing the hash data set looks like the following:

        <adf://some_arbitrary_uri> a adf-hash:HashDataSet;
          adf-audit:hasDigest [
              a adf-audit:Digest
              adf-hash:checksumDatasetOf <adf://some_dataset>
              audit:hasDigestMethod [
                  a audit:DigestMethod;
                  adf-hash:hashDimension [
                      a adf-hash:HashDimensionDescription;
                      hdf:order 1 ;
                      adf-hash:hashBlockSize 1000;
                  ];
                  adf-hash:hashDimension [
                      a adf-hash:HashDimensionDescription;
                      hdf:order 2;
                      adf-hash:hashBlockSize 500;
                  ];
                  adf-audit:hasDigestAlgorithm lc-hash:sha-256;
              ]
          ]

The shown check sum data set has two dimensions where the number of elements to hash in the first dimension is 1000, in the second dimension, 500. The digest algorithm SHA-256, which is used to aggregate the data is also specified.

Version	Release Date	Remarks
1.2.0	2016-12-07	Initial version
1.3.0 Preview	2017-03-31	Updated versions and dates
1.3.0 RC	2017-05-05	Updated versions and dates Updated standard algorithm
1.3.0 RF	2017-06-30	Updated versions and dates Updated document to describe changes in ADF 1.3.0 RF
1.4.2	2018-01-25	Updated versions and dates Updated document to describe changes in ADF 1.4.2
1.4.3 RC	2018-10-11	Updated versions and dates
1.4.5 RF	2018-12-17	Updated versions and dates
1.5.0 RC	2019-12-12	Updated versions and dates
1.5.0 RF	2020-03-03	Updated HDF5 reference link
1.5.3 RF	2020-11-30	Updated broken reference links Updated PURL and DOCS server links to relative links Reformat the document header